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This paper proposes consistent estimators for transformation pa- 
rameters in semiparametric models. The problem is to find the op- 
timal transformation into the space of models with a predetermined 
regression structure like additive or multiplicative separability. We 
give results for the estimation of the transformation when the rest of 
the model is estimated non- or semi-parametrically and fulfills some 
consistency conditions. We propose two methods for the estimation of 
the transformation parameter: maximizing a profile likelihood func- 
tion or minimizing the mean squared distance from independence. 
First the problem of identification of such models is discussed. We 
then state asymptotic results for a general class of nonparametric es- 
timators. Finally, we give some particular examples of nonparametric 
estimators of transformed separable models. The small sample per- 
formance is studied in several simulations. 

1. Introduction. Taking transformations of the data has been an inte- 
gral part of statistical practice for many years. Transformations have been 
used to aid interpretability as well as to improve statistical performance. 
An important contribution to this methodology was made by Box and Cox 
(1964) who proposed a parametric power family of transformations that 
nested the logarithm and the level. They suggested that the power trans- 
formation, when applied to the dependent variable in a linear regression 
setting, might induce normality, error variance homogeneity and additivity 
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of effects. They proposed estimation methods for the regression and trans- 
formation parameters. Carroll and Ruppert (1984) applied this and other 
transformations to both dependent and independent variables. A number of 
other dependent variable transformations have been suggested, for example, 
the Zellner-Revankar (1969) transform and the Bickel and Doksum (1981) 
transform. The transformation methodology has been quite successful and a 
large literature exists on this subject for parametric models; see Carroll and 
Ruppert (1988). In survival analysis there are many applications due to the 
interpretation of versions of the model as accelerated failure time models, 
proportional hazard models, mixed proportional hazard models and propor- 
tional odds models; see, for example, Doksum (1987), Wei (1992), Cheng 
and Wu (1994), Cheng, Wei and Ying (1995) and van den Berg (2001). 

In this work we concentrate on transformations in a regression setting. 
For many data, linearity of covariate effect after transformation may be too 
strong. We consider a rather general specification, allowing for nonparamet- 
ric covariate effects. Let X be a d-dimensional random vector and 1" be a ran- 
dom variable, and let {(Xj, 1^)}-^]^ be an i.i.d. sample from this population. 
Consider the estimation of the regression function m{x) = E{Y \ X = x). 
Stone (1980, 1982) and Ibragimov and Hasminskii (1980) showed that the 
optimal rate for estimating m is n~^^^'^^~^'^\ with i a measure of the smooth- 
ness of m. This rate of convergence can be very slow for large dimensions d. 
One way of achieving better rates of convergence is making use of dimension 
reducing separability structures. The most common examples are additive 
or multiplicative modeling. An additive structure for m, for example, is a re- 
gression function of the form m{x) = J2a=i f^ai^a), where x = (xi, . . . ,Xd) 
are the d-dimensional predictor variables and rria are one-dimensional non- 
parametric functions. Stone (1986) showed that for such regression curves 
the optimal rate for estimating m is the one-dimensional rate of convergence 
^-t/{2t+i) ^ Thus, one speaks of dimensionality reduction through additive 
modeling. 

We examine a semiparametric model that combines a parametric transfor- 
mation with the flexibility of an additive nonparametric regression function. 
Suppose that 



where e is independent of X, while G is a known function and A is a 



model in which A is monotonic and G{z) = Yl,a=i previously ad- 

dressed in Breiman and Friedman (1985) who suggested estimation proce- 
dures based on the iterative backfitting method, which they called ACE. 
However, they did not provide many results about the statistical proper- 
ties of their procedures. Linton, Chen, Wang and Hardle (1997) considered 
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the model with A = Ag parametric and additive G, G{z) = Ylti=i ^a- They 
proposed to estimate the parameters of the transformation A by either an 
instrumental variable method or a pseudo-likelihood method based on Gaus- 
sian £. For the instrumental variable method, they assumed that identifica- 
tion held from some unconditional moment restriction but they did not 
provide justification for this from primitive conditions. Unfortunately, our 
simulation evidence suggests that both methods work poorly in practice and 
may even be inconsistent for many parameter configurations. To estimate 
the unknown functions rria they used the marginal integration method of 
Linton and Nielsen (1995) and, consequently, their method cannot achieve 
the semiparametric efficiency bound for estimation of 9 even in the few cases 
where Gaussian errors are well defined and their method is consistent. 

We argue that an even more general version of the model (1) is identified 
following results of Ekeland, Heckman and Nesheim (2004). For practical 
reasons, we propose estimation procedures only for the parametric transfor- 
mation case where A(y) = Ao^{y) for some parametric family {Aq{-), 6 G Q} 
of transformations where C M'^. This model includes, for example, the 
Nielsen, Linton and Bickel (1998) (reversed) proportional hazard model 
where the baseline hazard is parametric and the covariate effect is non- 
parametric. This is appropriate for certain mortality studies where there are 
well established models for baseline mortality but covariate effects are not 
so well understood. To estimate the transformation parameters, we use two 
approaches. First, a semiparametric profile likelihood estimator (PL) that 
involves nonparametric estimation of the density of e, and second, a mean 
squared distance from the independence method (MD) based on estimated 
c.d.f.'s of {X,e). Both methods use a profiled estimate of the (separable) 
nonparametric components of tjiq. We use both the integration method and 
the smooth backfitting method of Mammen, Linton and Nielsen (1999) to 
estimate these components. The MD estimator involves discontinuous func- 
tions of nonparametric estimators and we use the theory of Chen, Linton 
and Van Keilegom (2003) to obtain its asymptotic properties. We derive the 
asymptotic distributions of our estimators under standard regularity con- 
ditions, and we show that the estimators of 9o are root-n consistent. The 
corresponding estimators of the component functions mj{-) behave as if the 
parameters Oq were known and are also asymptotically normal at nonpara- 
metric rates. 

The rest of the paper is organized as follows. In the next section we clarify 
identification issues. In Section 3 we introduce the two estimators for the 
transformation parameter. Section 4 contains the asymptotic theory of these 
two estimators. Additionally, we discuss tools like bootstrap for possible 
inference on the transformation parameter. Finally, in Section 5 we study 
the finite sample performance of all methods presented and compare the 
different estimators of the transformation parameter, as well as the different 
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estimators of the additive components in this context. A special emphasis 
is also given to the question of bandwidth choice. All proofs are deferred to 
Appendix A and Appendix B. 

2. Nonparametric identification. Suppose that 



where e is independent of X with unknown distribution , and the functions 
A and m are unknown. Then 



Recently Ekeland, Heckman and Nesheim (2004), building on ideas of Horowitz 
(1996, 2001), have shown that this model is identifiable up to a couple of 
normalizations under smoothness conditions on (Fg, A,?ti) and monotonicity 
conditions on A and F^. The basic idea is to note that, for each j, 



where X{y) = dA{y)/dy. Then by integrating out either y or x, one obtains 
A(-) up to a constant or dm{-) / dxj up to a constant. By further integrations, 
one obtains A(-) and m[-) up to a constant. One then obtains -F^ by invert- 
ing the relationship (3) and imposing the normalizations. Horowitz (1996) 
indeed covers the special case where m{x) is linear. 

The above arguments show that for identification it is not necessary to 
restrict A, m or beyond monotonicity, smoothness and normalization 
restrictions. However, the implied estimation strategy can be very compli- 
cated; see, for example, Lewbel and Linton (2006). In addition, the fully 
nonparametric model does not at all reduce the curse of dimensionality in 
comparison with the unrestricted conditional distribution Fy|j(^(y,x), which 
makes the practical relevance of the identification result limited. This is why 
we consider additive and multiplicative structures on m and a parametric 
restriction on A. The unrestricted model could be used for testing of these 
assumptions, although we do not pursue this in this paper. 

To conclude this section, we discuss briefly some related work on identi- 
fication of related models. Linton, Chen, Wang and Hardle (1997) assumed 
identification of the model (2) with parametric A and additive m based 
on an unconditional moment restriction on the error term rather than full 
independence. In particular, they assumed that E[Z£] = for a vector of 
variables Z. This does not seem to be sufficient to justify identification and, 
indeed, our simulation evidence supports this concern. Finally, we mention a 
nonparametric identification result of Breiman and Friedman (1985). They 



(2) 



A{Y) = m{X) + e, 



(3) 



FY\x{y,x) = Pr[y < y\X = x] = F,(A(y) - m{x)). 
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defined functions A(-),mi(-), . . . ,md(-) as minimizers of the least squares 
objective function 

(5) e^(A.m,...,™.) = S{^(^>-E"="""<^"»'l 



for general random variables Y,Xi, . . . ,Xii. They showed the existence of 
minimizers of (5) and showed that the set of minimizers forms a finite dimen- 
sional linear subspace (of an appropriate class of functions) under additional 
conditions. These conditions were that: (i) A(Y) — J2a=i i^^ai^a) = a.s. im- 
plies that A(y), ma(X„) = a.s., a = (ii) ^[A(y)] = 0, ^[m„(X„)] = 
0, £'[A^(y)] < oo, and E[m'^{Xa)] < oo; (iii) The conditional expectation 
operators £'[A(y)|XQ,], E[ma{Xa)\Y], a = 1, . . . ,d are compact. This result 
does not require any model assumptions like conditional moments or inde- 
pendent errors, but has more limited scope. We shall maintain the model 
assumption of independent errors in the sequel. 



3. Estimating the transformation. In the sequel we consider the model 
(6) AeAY) = m{X) + e, 

where {Aq : 9 G 0} is a parametric family of strictly increasing functions, 
while the function m(-) is of unknown form but with a certain predeter- 
mined structure that is sufficient to yield dimensionality reduction. We 
assume that the error term e is independent of X, has distribution F, 
and E{e) = 0. The covariate X is (i-dimensional and has compact support 
X = Yla=i ^Xa- Among the many transformations of interest, the follow- 
ing ones are used most commonly: (Box-Cox) Ag[y) = ^ {0 ^ 0) and 
Aeiy) = log(?/) {9 = 0); (Zellner-Revankar) Agd/) = \ny + 9y'^; (Arcsinh) 
Ag{y) = sinh~^ {9y)/9. The arcsinh transform is discussed in Johnson (1949) 
and more recently in Robinson (1991). The main advantage of the arcsinh 
transform is that it works for y taking any value, while the Box-Cox and 
the Zellner-Revankar transforms are only defined if y is positive. For these 
transformations, the error term cannot be normally distributed except for a 
few isolated parameters, and so the Gaussian likelihood is misspecified. In 
fact, as Amemiya and Powell (1981) point out, the resulting estimators (in 
the parametric case) are inconsistent when only ra — > oo . 

We let denote a finite dimensional parameter set (a compact subset of 
R'^) and Ai an infinite dimensional parameter set. We assume that is a 
vector space of functions endowed with metric || • ||_a4 — || ■ ||oo- 

We denote 

9o £ Q and nio £ M as the true unknown finite and infinite dimensional 
parameters. Define the regression function 



me{x) = E[AgiY)\X = x] 
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for each 6 £ @. Note that mQ^{-) = mo{-). 

We suppose that we have a randomly drawn sample Zi = i = 

1, . . . , n, from model (6). Define, for € and m £ M, 

ei9,m)=Ae(Y)-m{X), 

and let eg = e{9) = e{6,mg) and Eq = Eq^. When there is no ambiguity, we 
also use the notation e and m to indicate Eg and rrio. Moreover, let Aq = Ag^. 

In the sequel we will denote by ffiQ any estimator of mg under either the 
additive or the multiplicative model. In the simulation section we will fo- 
cus on the additive model and the smooth backfitting estimator, denoted by 
mg^(-). See Mammen, Linton and Nielsen (1999) for its definition, m?^ con- 
sistently estimates a function mg^(-), where rn^J {■) = mg^{-)^ but (•) 7^ 
mg{-) for 0^00- 

3.1. The profile likelihood (PL) estimator. The method of profile likeli- 
hood has already been applied to many different semiparametric estimation 
problems. The basic idea is simply to replace all unknown expressions of 
the likelihood function by their nonparametric (kernel) estimates. We con- 
sider Ag(y) = mg{X) + Eg for any £Q. Then, the cumulative distribution 
function is 

Pr[y < y\X] = Pr[Ae(y) < Kg{y)\X] 

= Y>Y[Eg<Kg{y)-mg{X)\X] 
= F,^g){Kg{y)-mg{X)), 

where Fj(e)(e) = -Fe(0,mg)(e) and = P{E{e,m) < e), and so 

fY\x{y\x) = fe{e){^e{y) - mg{x))Ag{y), 

where /^(e) and fY\x are the probability density functions of e{9) and of Y 
given X . Then, the log likelihood function is 

n 

Y,{\ogf,^g^iAgiYi)-mg{X,))+logA'g{Y,)}. 

1=1 

Let 

(T) A,„(„.= i-f;i,,(i^), 

ngf^l \ g J 

with Ei{9) = Ei{9,mg) and £i{9,m) = Ei{9,fh) = Ag{Yi) — m{Xi). Here, K2 
is a scalar kernel and g is a bandwidth sequence. Then, define the profile 
likelihood estimator of 9o by 

n 

(8) ^PL = argmax^[log/,(e)(Ae(y.) - m,(X,)) + log A'e(y,)]. 
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The computation of ^pl can be done by grid search in the scalar case and 
using derivative-based algorithms in higher dimensions, assuming that the 
kernels are suitably smooth. 

3.2. Mean square distance from independence (MD) estimator. There 
are four good reasons why it is worth providing alternative estimators when 
it comes to practical work. First, as we will see in Section 5, the profile 
likelihood method is computationally quite expensive. In particular, so far 
we have not found a reasonable implementation for the recentered boot- 
strap. Second, for that approach we do not only face the typical question 
of bandwidth choice for the nonparametric part mg, we additionally face a 
bandwidth for the density estimation; see equation (7). Third, there are some 
transformation models Ag for which the support of Y depends on the pa- 
rameter 6 and so are nonregular. Finally, although the estimator we get from 
the profile likelihood is under certain conditions efficient in the asymptotic 
sense [Severini and Wong (1992)], this tells us little about its finite sample 
performance, neither in absolute terms nor in comparison with competitors. 

One possible and computationally attractive competitor is the minimiza- 
tion of the mean square distance from independence. Why it is computa- 
tionally more attractive will be explained in Section 5. This method we will 
introduce here has been reviewed in Koul (2001) for other problems. 

Define, for each G and m € A^, the empirical distribution functions 

1 " 

Fx{x) = -y^l{Xi<x)- 

1=1 

1 " 
^ i=i 
1 " 

Fxmi^^^) = -T.^{Xi< x)m{e) < e), 

1=1 

the moment function 

GnMD{0,me){x,e) = Fx,e(e){x,e) - Fx{x)F^(^Q){e) 
and the criterion function 

(9) \\GnMB{0,rhe)\\l = j [GnM-D{0,rhe){x,e)f d^{x,e) 

for some probability measure We define an estimator of 0, denoted ^md, 
as any approximate minimizer of ||G„MD(^5'7ig)||2 over 0. To be precise, let 

\\GnMT>{0MTi,fh-)\\2 = ini\\GnMTi{0,me)\\2 + Op(l/Vn). 
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There are many algorithms available for computing the optimum of general 
nonsmooth functions, for example, the Nelder-Mead, and the more recent 
genetic and evolutionary algorithms. 

We can use in (9) the empirical measure d^n of {Xi,ei{6)}f^i, which 
results in a criterion function 



In the sequel we will denote mg to indicate either the function E[A0(Y)\X = 
■] or the function ^ defined above (or the population version of any other 
estimator of mg). It will be clear from the context which function it repre- 
sents. 

4. Asymptotic properties. We now discuss the asymptotic properties of 
our procedures. Note that although nonparametric density estimation with 
non- or semiparametrically constructed variables has already been consid- 
ered in Van Keilegom and Veraverbeke (2002) and in Sperlich (2005), their 
results cannot be applied directly to our problem. The first one treated the 
more complex problem of censored regression models but have no additional 
parameter like our 0. Nevertheless, as they consider density estimation with 
nonparametrically estimated residuals, their results come much closer to our 
needs than the second paper. Neither offer results on derivative estimation. 
As we will see now, this we need when we translate our estimation problem 
into the estimation framework of Chen, Linton and Van Keilegom (2003) 
[CLV (2003) in the sequel]. 

To be able to apply the results of CLV (2003) for proving the asymp- 
totics of the profile likelihood, we need an objective function that takes its 
minimum at 9o- Therefore, we introduce some notation. For any function 

we define ip := d^p/dO and (p := dip/dO, respectively. Similarly, we define 
for any function (p: (p'{u) := d(p{u) / du and <p'{u) := d(p{u)/du, respectively. 
The same holds for any combination of primes and dots. 

We use the abbreviated notation s = (m, r, /, g, h), sg = {mg, rhg, f^(^g-), f'e[g)i 

fe{e)), So = sg^ and sg = {fhg,fhBJe(e)Je{gY fe{e))- Then, define for any 



(10) 



Qn{0) = -Y.\GnMB{e,ine){Xi,UG))f- 



s = {m,r, f,g,h), 




(11) 



X [g{ei{e, m)}{Ae{Y^ - r{X,)} + h{ei{e, m)}] 
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and let Gpl{0,s) = E[GnPL{0,s)], and Tipl = -^GpLiO, se)ie=eo- 

Note that ||Gpl(6', S6i)|| and ||G'„pl(6', se)|| take their minimum at 9o and 
0PL respectively (where || • || denotes the Euclidean norm). We assume in 
the Appendix that the estimator of the nonparametric index obeys a cer- 
tain asymptotic expansion. Note that, when the index is additively sepa- 
rable, typical candidates are the marginal integration estimator [Tj0stheim 
and Auestad (1994), Linton and Nielsen (1995) and Sperlich, Tj0stheim 
and Yang (2002) for additive interaction models] and the smooth backfit- 
ting [Mammen, Linton and Nielsen (1999) and Nielsen and Sperlich (2005)]. 
Both estimators obey a certain asymptotic expansion. The proof of such ex- 
pansions can be found in Lemmas 6.1 and 6.2 of Mammen and Park (2005) 
for backfitting and in Linton et al. (1997) for marginal integration. In con- 
sequence, we obtain expansions for /(.(g), /^(g), fe{e)- 

Theorem 4.1. Under Assumptions A.l-A.l given in Appendix A, we 
have 

9pL -Oo = -r-pYG„PL(0o, So) + Op{n-^/^), 

where Qpl = T~p^Vav{GiPL{Oo, So)}{Tlp^)~^ . 

Note that the variance of ^pl equals the variance of the estimator of 
6o that is based on the true (unknown) values of the nuisance functions 
mo,rho, fe, f'e and fe- For the smooth backfitting, we expect that the profile 
likelihood estimator is semiparametrically efficient following Severini and 
Wong (1992); see also Linton and Mammen (2005). 

We obtain the asymptotic distribution of ^md using a modification of 
Theorems 1 and 2 of CLV (2003). That result applied to the case where 
the norm in (9) was finite dimensional, although their Theorem 1 is true as 
stated with the more general norm. Regarding their Theorem 2, we need to 
modify only condition 2.5 to take account of the fact that G„md(^;"16i) is a 
stochastic process in (x, e). Let Xg{y) = A£)(y) = dAg{y)/d6 and let Aq = A^^. 
We also note that 

= / Xo{A-\mo{X) + e))fe{e)de. 
e=eo J 

Define the matrix 



-E[Ae{Y)\X] 



TiMTi{x, e) = Me)E[{l{X < x) - Fx{x)){Xo{A-\mo{X) + e)) + mo{X))], 
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and the i.i.d. mean zero and finite variance random variables 

U, = J [l{Xi <x)- Fx{x)][l{ei < e) - F,{e)]TiMB{x, e) dfi{x,e) 

d „ 

+ fx{Xi) VolaiXa^,ei) I fe{e){l{Xi <x)- Fx{x)) 

a=l 

X TiMD{x,e)d^i{x,e), 

where Voia{') is defined in Assumption A. 8 in Appendix A. 

Let ViMD = E[UiUj] and TiMD = /TimdI^;, e)rfj^j3(x, e) e). 

Theorem 4.2. Under Assumptions B.1-B.8 given in Appendix B, we 
have 

Omd -Oo = -TiMD^i + Op{n~'^l'^), 



where Qmd = TiMD^iMoriMo. 

Remarks. 1. The properties of the resulting estimators of m and its 
components follow from standard calculations as in Linton et al. (1997), 
Theorem 3: the asymptotic distributions are as if the parameters 9o were 
known. 

2. Bootstrap standard errors. CLV (2003) proposes and justifies the use of 
the ordinary bootstrap. Let {Z*}"^^ be drawn randomly with replacement 
from {Zi}f^i, and let 

CnMol^^HC^^.e) =F^^(e)(x,e) -Fj(x)F;(g)(e), 

where F^^^g^ F^{x) and F^(^g^ are computed from the bootstrap data. Let 
also mg(-) (for each 9) be the same estimator as ing{-), but based on the 
bootstrap data. Following Hall and Horowitz [(1996), page 897], it is nec- 
essary to recenter the moment condition, at least in the overidentified case. 
Thus, define the bootstrap estimator ^J^j-, to be any sequence that satisfies 



(12) 



MD 



where superscript * denotes a probability or moment computed under the 
bootstrap distribution conditional on the original data set {Zi}f^i. The re- 
sulting bootstrap distribution of \Ai(^md ~ ^md) can be shown to be asymp- 
totically the same as the distribution of i/n(0MD — 9o), by following the same 
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arguments as in the proof of Theorem B in CLV (2003). Similar arguments 
can be apphed to the PL method. 

3. Estimated weights. Suppose that we have estimated weights fin{x,e) 
that satisfy sup^ g \fin{x,e) — fi{x,e)\ = Op{l). Then the estimator computed 
with the estimated weights iJ,n{x,e) has the same distribution theory as the 
estimator that used the hmiting weights n{x,e). 

4. Note that the asymptotic distributions in Theorems 4.1 and 4.2 do not 
depend on the details of the estimator m^^{x), only on their population 
interpretations through 



(13) 



■ arg mm 

mGXadd 



where 



M 



add 



m : m{x) 



a=l 



fna{xa) for some mi(-), . . . , mrf(-) 



5. Performance in finite samples. We consider the following data gener- 
ating process: 

(14) Ae{Y) = bo + hXf + 62 sin(7rX2) + eae, 

where Aq is the Box~Cox transformation, Xi,X2 ~ C/[— 0.5, 0.5]^ and e drawn 
from A^(0, 1) but restricted on [—3,3]. We study three different models with 
60 = 3.0<Te +62 and 61, 62, fTg as follows: for model 1, we set 61 = 5.0, 62 = 2.0, 
(7e = 1.5; for model 2, 61 = 3.5, 62 = 1-5, (Tg = 1.0; and for model 3, 61 = 2.5, 
62 = 1.0, ae = 0.5. Parameter 60 is set to 0.0, 0.5 and 1.0. Note that A0{Y) 
is by construction always positive in our simulations. 

We estimated by a grid search on [—0.5, 1.5] with step length 0.0625. Our 
implementations for estimators of the additive index follow exactly Nielsen 
and Sperlich (2005) for the backfitting (BF), and Hengartner and Sperlich 
(2005) for the marginal integration (MI). We just show results for the BF 
method; results for marginal integration, further details and more results on 
the bootstrap can be found in Sperlich, Linton and Van Keilegom (2007). 
BF has been chosen as we know from Sperlich, Linton and Hardle (1999) 
that backfitting is more reliable when predicting the whole mean function — 
which matters more in our context — whereas MI has some advantages when 
looking at the marginal impacts. We use the local constant versions with 
quartic kernel K{u) = y|(1 — u'^)\ and bandwidth /ii = /i2 = n~^/^/io for a 
large range of /iQ-values. For the density estimator of the predicted residuals 
in the PL, we use Silverman's rule of thumb bandwidth in each iteration. 
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5.1. Comparing PL with MD. We first evaluate robustness against band- 
width. Table 1 gives the means and standard deviations calculated for sam- 
ples of size n = 100 from 500 replications for each 6o and different bandwidth. 
Since the parameter set = [—0.5, 1.5], the simulation results for 9o = 0.0 
and 1.0 are biased toward the interior of the Q. Note further that there 
is also an interaction between bandwidth and 6 (the estimated as well as 
the real one) concerning the smoothness of the model: using local constant 
smoothers, the estimates will have more bias for larger derivatives. On the 
other hand, both a smaller 9 and a larger ho make the model "smoother," 
and vice versa. We therefore study the bandwidth choice in a separate sim- 
ulation. 

Table 1 gives the results for any combination of model, bandwidth and 
method. If the error distribution is small compared to the estimation er- 
ror, then the MD is expected to do worse. Indeed, even though model 3 is 
the smoothest model and therefore the easiest estimation problem, for the 
smallest error standard deviation (ae = 0.5), the MD does worse. In those 
cases the PL estimator should perform better, and so it does. It might be 
surprising that 9 mostly gets better estimated in model 1 than in model 2 
and model 3, where the nonparametric functionals are much easier to esti- 
mate. But notice that for the quality of 9 the relation between estimation 
error and model error is more important. This is also true for the PL method. 
Nevertheless, at least for small samples, none of the estimators seems to out- 
perform uniformly the other: so the PL has mostly smaller variance, whereas 
MD has mostly smaller bias. As expected, for very small samples, the re- 
sults depend on the bandwidth. For this reason, and due to its importance 
in practice, we study this problem more in detail below. We should mention 
that the PL method is much more expensive to calculate than the MD. 

5.2. Bandwidth choice. Perhaps the simplest approach conceptually would 
be to apply plug-in bandwidths. However, this method relies on asymptotic 
expressions with unknown functions and parameters that are even more com- 
plicated to estimate. Furthermore, in simulations [see Sperlich, Linton and 
Hardle (1999) or Mammen and Park (2005)] they turned out not to work 
satisfactorily. Instead, we applied the cross-validation method for smooth 
backfitting developed in Nielsen and Sperlich (2005) and adapted to our 
context. 

In Table 2 we give the results for minimizing the MD over 9 gQ choosing 
cross validation. Notice that we allow for different bandwidths 
for each additive component. The simulations are done as before, but only 
for model 1 and based on just 100 simulation runs what is enough to see 
the following: The results presented in the table indicate that this method 
seems to work for any 9. We have added here the results for the case n = 200. 
It might surprise that the constant for "optimal" cv — bandwidths does not 



Table 1 

Performance of MD and PL; Means (first line), standard deviations (second line) and mean squared error (third line) of 9 for different 
60, models [see (14)], and bandwidths ha = hon^^^^ , a = 1,2, for sample size n = 100. All numbers are calculated from 500 replications 
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Table 2 

Simulation results for different sample sizes n with cross validation bandwidth to 
minimize (10) with respect to 9. Numbers are calculated from 100 replications 
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n 
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mean(0) 


std{e] 
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0.01 
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0.01 


0.5 
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0.53 


0.28 


0.55 


0.29 


0.09 


1.0 


0.83 


0.61 


0.40 


1.0 


0.37 


0.14 



only change with 9, but even more with n (not shown in table). Have in mind 
that in small samples the second order terms of bias and variance are still 
quite influential and, thus, the rate is to be taken carefully; compare 

with the above convergence-rate study. 

A disadvantage of this cross validation procedure is that it is computa- 
tionally rather expensive, and often rather hard to implement in practice. 
This is especially true if one wants to combine the cross validation method 
with the PL method. Sperlich, Linton and Van Keilegom (2007) discuss some 
alternative approaches like choosing 9 and the bandwidth, simultaneously 
minimizing, respectively maximizing, the considered criteria function (8), re- 
spectively (10). In the same work are given results on the performance of the 
suggested bootstrap procedures which turn out there to perform reasonably 
well. 

5.3. Comparison with existing methods. To our knowledge, the only ex- 
isting method comparable to ours has been proposed by Linton, Chen, Wang 
and Hardle (1997). They considered the criterion functions 

1 " r 1 1 

Qs = {eJZ W Z^ee) and Q4 = - V Je(>i) - In -ejee , 

n~l In J 

where eg = {el, ... , e^)^ is the vector of residuals of the transformed model 
using 9, while Z = {Zi,...Zn)'^ are i.i.d. instruments with the property 
E[Zieg] = 0. Here, W is any symmetric positive definite weighting matrix, 
and Jg is the Jacobian of the transformation Ag. When we tried to estimate 
9 in our simulation model (14), both criteria gave us always —0.25 for any 
data generating 9o. This was true for whichever smoother we used [in their 
article they just work with the marginal integration estimator] . The problem 
could come from the fact that they do not take care for the change of the 
total variation when transforming the response variable Y. Therefore, we 
have tried some modifications norming the criteria function by the total 
variation. Then the results change a lot, but still fail in estimating 9. 
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APPENDIX A: PROFILE LIKELIHOOD ESTIMATOR 

To prove the asymptotic normality of the profile likelihood estimator, 
we will use Theorems 1 and 2 of Chen, Linton and Van Keilegom (2003) 
[abbreviated by CLV (2003) in the sequel]. Therefore, we need to define the 
space to which the nuisance function s = {m,r, f,g,h) belongs. We define 
this space by Hpl = M"^ x C|(IR)^ where C^{R) {0 < a < oo, < 6 < 1, 
RcM.^ for some k) is the set of all continuous functions / : — > M for which 

1,,,,^ \fiy)-f{y')\ ^ 

sup I /(y) I + sup — . — <a, 

y y,y' \y ~ y \ 

and where the space M depends on the model at hand. For instance, when 
the model is additive, a good choice for is = J2a=iCl{Rxa), and 
when the model is multiplicative, Ai = Y\a=iCl{Rxa)- We also need to 
define, according to CLV (2003), a norm for the space TCph- Let 

PIIpl = supmax{||m6i||oo, \\re\\oo, Wfeh, heh, \\heh}, 

6*60 

where || • ||cxd (|| • II2) denotes the L^o (L2) norm. Finally, let's denote || • || for 
the Euclidean norm. 

We assume that the estimator fhg is constructed based on a kernel func- 
tion of degree qi, which we assume of the form Ki{ui) x • • • x Ki{ud), and 
a bandwidth h. The required conditions on Ki,qi and h are mentioned in 
the list of regularity conditions given below. 

A.l. Assumptions. We assume throughout this appendix that the condi- 
tions stated below are satisfied. Condition A.1-A.7 are regularity conditions 
on the kernels, bandwidths, distributions Fx, F^, etc., whereas condition A. 8 
contains primitive conditions on the estimator fhg that need to be checked 
depending on which model structure and which estimator me one has chosen. 

A.l The probability density function Kj (j = 1,2) is symmetric and has 

compact support, / u^Kj{u) du = iov k = 1, . . . , qj — 1, J u'^^ Kj{u) du 7^ 

and Kj is twice continuously differentiable. 
A. 2 nh — > 00, n/i^'^i — > 0, ng^ {iog g~^)~'^ 00 and ng'^'^^ — > 0, where qi and 

q2 are defined in condition A.l and (/i,(?2 > 4. 
A. 3 The density fx is bounded away from zero and infinity and is Lipschitz 

continuous on the compact support X. 
A. 4 The functions mg{x) and rhe{x) are qi times continuously differentiable 

with respect to the components of x on A' x M{6o), and all derivatives 

up to order qi are bounded, uniformly in {x,0) in X x M{0o). 
A. 5 The transformation Ag{y) is three times continuously differentiable in 

both 9 and y, and there exists a 6 > such that 

Qk+l 

fAe>{Y) <oo 



E 



sup 

)'-e||<<5 



Qyk QQl 
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for all 6* in e and all < A; + / < 3. 
A. 6 The distribution F^(^Q-^{y) is three times continuously differentiable with 
respect to y and 6, and 



sup 

e,y 



Qk+l 



< oo 



for all < ^ + Z < 2. 
A. 7 For all > 0, there exists e(r/) > such that 

inf ||GpL(^,s,)||>e(7?)>0. 

P-eo\\>v 

Moreover, the matrix Fipl is of full (column) rank. 
A. 8 The estimators fho and fho can be written as 

rho{x) -mo{x) = ^ X! X! ( " "M "^olal^m, g^) 

1=1 a=l 

1 

+ - y2vo2{Xi,ei) + Vo{x) 

and 

1 " /Xa-Xai\ 
fhoix) -rhoix) = ;^^^-glf ° ^ jWola{Xai,ei) 

1=1 a=l ' 
1 " 

n 

1=1 

where sup^. |?^o(2;)| = Op(n~^/^), sup^ |'u;o(x)| = Op{n~^/'^), the functions 
Voia{x,e) and Woia{x-,e) are qi times continuously differentiable with 
respect to the components of x, their derivatives up to order qi are 
bounded, uniformly in x and e, E{vo2{X,e)) = and E{wo2{X,e)) = 0. 
Moreover, with probability tending to 1, mg,mg £ Ai, sup^g© \\mg — 
mg\\ = Op(l), supgge Wfne - ine\\ = Op(l), \\mg - me\\ = Op{n~^/^) and 
Wfhe — mg\\ = Op(n~^/^) uniformly over all 9 with \\6 — 6o\\ = o(l), and 

sup \ {me - mg){x) - {fho - mo){x)\ = Op{l)\\9 - 9o\\ +Op(n"^/^) 

X 

for all 9 with \\9 — 9o\\ = o(l). Finally, the space M satisfies 
/ ^/logN{\,M., II • lloo) < oo, where A^(A,A^, || • ||oo) is the covering 
number with respect to the norm || • ||oo of the class M, that is, the 
minimal number of balls of || • | loo-radius A needed to cover M. 
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A. 2. Proof of Theorem 4.1. The proof consists of verifying the condi- 
tions given in Theorem 1 (regarding consistency) and Theorem 2 (regarding 
asymptotic normaUty) in CLV (2003). In Lemmas A.4-A.11 below, we verify 
these conditions. The result then follows immediately from those lemmas, 
assuming that the primitive conditions on ffiQ and the regularity conditions 
stated in A.1-A.8 hold true. Before checking the conditions of these theo- 
rems, we first need to show three preliminary Lemmas A.1-A.3 which give 

asymptotic expansions for the estimators /g, and f ^. The proofs of all 
lemmas are deferred to Section A. 3. 



Lemma A. 1. ForallyeM., 

n 

%{y) - fe{y) = n-^ J2 K^ai^^ -y)- fe{y) 



1=1 



+ /^(y)n-i^ 



i=l 



^Vola {Xai , £i ) fx^ {Xai ) + f o2 (e^ ) 



.a=l 



+ ro{y), 

where supy\ro{y)\ = Op(n~^/^), and where the functions Voia and Vo2 are 
defined in Assumption A. 8. Moreover, 

supsup|/e(e)(y) - fe{e){y)\ =Op{l) 
y 6»ee 



and 



sup sup \J^(^Q)\ 
y ||6»-6»o||<<S„ 



\Iem{y)-fem{y)\=o,{n-^'') 



for all 5n = o(l). 

In a similar way as for Lemma A.l, we can prove the following two results. 
The proofs are omitted. 

Lemma A. 2. ForallyeR, 

n 

feiy) - My) = ingy'J2^2gi^^ - y)iMy^ - ^9iX^)) - feiy) 



i=l 



+ f'e{y)n~'Yl 



i=l 



Vola{Xai,£i)fXc,iXai)+Vo2iei) 

a=l 
d 

^Wola {Xai , e j ) fx„ {Xai ) + U^o2 (e j ) 



.a=l 



+ ro{y), 
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where supgy\ro{y)\ =Op(n~^/^). Moreover, 



y 6*60 



and 



sup sup I /^(^(y) - A(e)(y)| = Op{n ^/^) 
y ||6»-e„||<<5„ 

for all 5n = 0(1). 

Lemma A. 3. ForallyeR, 

n 



i=l 

n 



i=l 



.0=1 



+ ro{y), 

where supy|ro(y)| — Op{n Moreover, 



and 



supsup|/^g^(y) - fe[e){y)\=Op{l) 
y eee ^ ' ^ ' 



sup sup \feie)iy)- fe(e)iy)\=Opin ^^^) 

y l|e-eoi|<<5„ 

for all 6n = 0(1). 

Lemma A. 4. Uniformly for all 9 £ Q, Gpl(^,s) is continuous (with re- 
spect to the II • Wpi.-norm) in s at s = sq. 



Lemma A. 5. 



and 



supsup|/e(e)(y) - fe{e){y)\ =Op{l), 
y 6»g0 

supsup|/g(0)(y) -/e(e)(y)| = Op{l) 
y 9G0 



supsup|/^g)(y) -/^g^(y)| =Op(l). 
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Lemma A. 6. For all sequences of positive numbers 5n = o{l), 

sup \\GnPL{e,s)-GpUe,s)\\=Op{l). 

6'e©,||s-se||pL<5n 

Lemma A. 7. The ordinary partial derivative in 9 of Gpi,{9, sq) , denoted 
riPL(^5'Se), exists in a neighborhood of 9o, is continuous at 9 = 9o, and the 
matrix FiPL = riPL(^o) So) is of full (column) rank. 

For any 9 gQ, we say that Gpi,{9,s) is pathwise differentiable at s in 
the direction [s — s] if {s + t{s — s) :t £ [0, 1]} C TCph and limr^o[Gpi,{9 , s + 
t{s — s)) — Gpl(0,s)]/t exists; we denote the hmit by T 2PLi0, s)[s — s]. 

Lemma A. 8. The pathwise derivative r2PL(^,se) of Gpi,{9, sg) exists in 
all directions s — sg and satisfies the following: 

(i) \\GpU9, s) - Gpl{9, se) - LsplI^, se)[s - se]\\ < c\\s - sg\\l^ 

for all 9 with \\9 — 9o\\ = o(l), all s with \\s — S£)||pl = o(l), some constant 
c < oo; 

(ii) \\'^2Pl{0, se)[se - sg] - T2pl{0o,So)[so - So]\\ 
<c||^-^o|| xop(l) + Op(n^i/2) 

for all 9 with \\9 — 9o\\ = o{l), where s = {m,fh,ffr,f^,f'^). 

Lemma A. 9. With probability tending to one, fe,fe,fe ^ Gi{R). More- 
over, 

sup sup \fe{e){y) - fe{e){y)\ = Op{n~^/^), 

y ||6»-6»o||<5„ 

sup sup \fe(e){y) - fs{e){y)\ = Op(n~^/^) 
y \\e-eo\\<Sn 

and 

sup sup I he) (y) - fl<e) (y) \ = Opin-^^^), 

y ||6»-6»o||<<5„ 
for any 5n = o(l). 

Lemma A. 10. For all sequences of positive numbers {5n} with 5n = o(l), 
sup \\GnPL{0,s) - GpLi9,s) - GnPL{Oo,So)\\ = Op{n~'^/'^). 

\\9~9o\\<&n,\\s~Sg\\Y'-L<&n 

Lemma A. 11. 

V^{GnPL(^o, So) + T2Pd9o,So)[s - So]] =^ iV(0, Var{GiPL (^o, So)]). 
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A.3. Proofs of Lemmas A.l A.ll. 

Proof of Lemma A.l. Write 

fe{y) - fe{y) 
1 " 

+ K2M -y)- feiy) + Op(n-i/2) 



(15) 



n 



1 



i=l 



1 



n d 



ng ^ ^ I n 



i=l 



k=la=l 



(16) 



+ - 5Z ^o2(efc) +Vo{Xi) 
" fc=i 

1 " 

+ K2g{e^ -y)- fe{y) + Op(n-V2) 

^ d n 1 " 

— X! X! ^^ola(-'fafc,efc)¥'mA: + /^(y)- X! ^o2(efc) 
a=lj,fc=l k= 

1 " 

+ - ^ J^23(ei - y) - /.(y) + Op(n-i/2)^ 



fc=i 



i=l 



where = -iii:^g(ei - y)KihiXai - X^k)- Since E{Lpnik\Xk) 

fe{y)fxA^ak) + Op{l), it follows that (16) equals 



1 



fc=i 



'^Vola{Xak,£k)fXa{Xak) + Voli^k) 



1 " 

- K2g{ei -y)- Uy) + Op{n~'/^). 

Ti . 

1 = 1 



□ 



Proof of Lemma A. 4. Note that 
Gpl(0,s) 

1 



E 



f{e{e,m)) 



{g{e{e,m)){Ae{Y)-r{X)) + h{e{9,m))} + ^^^^ 



which is continuous in s at s = se, provided conditions A.4-A.6 are satisfied. 
□ 
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Proof of Lemma A. 5. This follows from Lemmas A.1~A.3. □ 



Proof of Lemma A. 6. The proof is similar to (but easier than) that 
of Lemma A. 10. We therefore omit the proof. □ 

Proof of Lemma A. 7. This follows from Assumption A. 7. □ 

Proof of Lemma A. 8. Some straightforward calculations show that 
^2PL{d, Sg)[se - Sg] 

se))-GpUe,sg)} 



: lim-{GpL(^, S0 + t(J 

T— >0 T 



E 



ife{e) - fe{e)){^e) 



(17) 



T)^\ {^e - me){X) 

X |/;(,)(e,)[Ae(y) -m,(X)] + A(e)(e,) 
1 



+ 



-{ - f'{ee)[Kg{Y) - mg{X)\{me - me){X) 



+ {n^e)-f'e(O))i^0)[UY)-rhe{X)] 
- fe[e){^e){fhe - me){X) 

+ ifs{e) - fe{e)){£e) - fl(^e){^e){me - me){X)} 

The first part of Lemma A. 8 now follows immediately. The second part 
follows from the uniform consistency of m, in, f^(^B), f e{e) ^'^'^ f'e(e)^ 
from the fact that 

supKmg - me))(x) - {fho - "io)(a;)| = Op{l)\\6 - 9o\\ + Op(n~^/^), 

X 

which follows from Assumption A. 8. □ 

Proof of Lemma A. 9. This follows from Lemmas A.1-A.3. □ 

Proof of Lemma A. 10. We wih make use of Theorem 3 in Chen, 
Linton and Van Keilegom (2003). According to this result, we need to prove 
that 

(i) 



E 



sup \gpi,{X,Y,e',s')-gpi,{X,Y,e,s) 

<»7,||s'— s||pL<»? 
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for all {0, s) G 6 X ?^pl, all t? > and for some if > 0. 
(ii) 



y^logiV(A,HpL,|| • ||pL)dA < oo. 



Part (ii) follows from Corollary 2.7.4 in van der Vaart and Wellner (1996), 
together with Assumption A. 8. Part (i) follows from the mean value theorem, 
together with the differentiability conditions imposed on the functions of 
which the function g-pi^ is composed. □ 

Proof of Lemma A. 11. Combining the formula of r2PL(^o,So) given 

in (17) with the representations of /^(e), f e{e) ^'^^ fe[e) given in Lemmas 
A.1-A.3, we obtain after some calculations 

)+r2PLi0o,So)[s-So] 
1 



n 



+ E 



1 1 



Ei - £ 



fe{e) 



(18) 



/1(e) V g 

x{f',{e)[ko{Y)-mo{X)] + fe{e)] 



fe{e) \ ng^ ^ 
1 f 1 



E^2 



+ 



fs{e) [ng'^ f^^ 

We next show that 



E^2 



9 

g 



-/^(e)|{A,(y)-m„(X)} 
(K{Yi)-mo{Xi))-U{e)\ 



(19) 

(20) E 
and 

E 

(21) 



E 



Me) 



1 f 1 " 

' E^^ 



Ei -e 



fe{e) 

{K{Yi)-mo{X,))- Ue)\ 



1 1 



/1(e) \ T^g~{ 



£i - £ 



>{fUe)[UY)-mo{X)]+Ue)} 



+ 



1 ^ 



feie) [ ng"^ ^ 



£i - e 



>{ko{Y)-mo{X)] 



0. 
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It then follows that only the first term on the right-hand side of (18) [i.e., 
the term GnPhido, So)] is nonzero, from which the result follows. We start 
by showing (19): 



E 



Me) 
fe{e) 



d 

feiy)dy = -^ I fe{e){y)dy 



0, 



since / f£(e){y)dy = 1. Next, consider (20). The left-hand side equals 



ng- 



■Y^{ko{Yi)-mo{X,))E 



i=l 



E 



fe{e) 
fe{e) 



. n 

= — V(Ao(l^^) - mo{Xi)) / K'^{u) du = 0. 
ngfr{ J 

Finally, for (22), note that the left-hand side can be written as 



1 

ng 



i=l 



1 



-K2 



Ei — e\ d 



9 



e^ - eje) 
9 



fe{e) 



ng 



EE 

i=l 



d K2i{e^-ei9))/g) 



de 



^ de 



ng 



i=l 



feie){e{9)) 
Ei - e 



9 



de = 0. 



since / K2{^^-z^) de = g. This finishes the proof. □ 



APPENDIX B: MD ESTIMATOR 

B.l. Assumptions. We assume throughout this appendix that Assump- 
tions B.1-B.8 given below are valid. 

B.l The probability density function Ki is symmetric and has compact 

support, / u^Ki{u) du = for k = 1, . . . ,qi — 1, J u'^^Ki{u) du ^0 and 

Ki is twice continuously differentiable. 
B.2 nh — > oo and nh'^'^^ 0, where qi is defined in condition B.l and > 4. 
B.3 The density fx is bounded away from zero and infinity and is Lipschitz 

continuous on the compact support X. 
B.4 The function nig(x) is qi times continuously differentiable with respect 

to the components of x on A' x M{eo), and all derivatives up to order 

qi are bounded, uniformly in {x,e) in X x J\f{6o). 
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B.5 The transformation Ao^y) is twice continuously differentiable in both 6 
and y, and there exists a 6 > such that 



E 



sup |A,/(y)|' 

\\e-e'\\<5 



< oo 



for all k and for all 6 in Q. 
B.6 The distribution Fi,{y) is twice continuously differentiable with respect 

to y, and sup^ < oo. 

B.7 For all i] > 0, there exists e{T]) > such that 

inf \\GMD{e,me)\\2>e{7])>0. 

\\e~ea\\>v 

Moreover, the matrix Tiy[i){x,e) (defined in Section 4) is of full (col- 
umn) rank for a set of positive /x-measure (x,e). 
B.8 The estimator fho can be written as 

n d 



rhoix) -mo{x) = — ^ ^ -gl f " ^ °M t'ola(^m,£i) 
1=1 a=l ^ ^ 

1 " 

+ - ^Vo2{Xi,ei) + Vo{x), 



1=1 



where sup^ |^'o(2;)| = Op(n~^/^), the function Voia{x^ e) is qi times contin- 
uously differentiable with respect to the components of x, their deriva- 
tives up to order qi are bounded, uniformly in x and e, E{vo2{X,e)) = 0. 
Moreover, with probability tending to 1, fhe G A^, supgge \\fh0 — me\\ = 
Op(l), \\fh0 — itlqW = Op(n~^/^) uniformly over all 6 with \\6 — 6o\\ = o(l), 
and 

sup|(m0 - mQ){x) - {fho - ino){x)\ = Op{l)\\6 - 8o\\ + Op(n~^/^) 

X 

for all 9 with \\9 — 9o\\ = o(l). Finally, the space A4 satisfies 
/ ^logN{\,M,\\ ■ \\oc)dX < 00. 



B.2. Proof of Theorem 4.2. We use a generalization of Theorems 1 (about 
consistency) and 2 (about asymptotic normality) of Chen, Linton and Van 
Keilegom (2003), henceforth, CLV (2003). Below, we state the primitive 
conditions under which these results are valid (see Lemmas B.1-B.6). Their 
proof is given in Section B.3. 

Given these lemmas, we have the desired result. We just reprieve the last 
part of the argument because it is slightly different from CLV (2003) due to 
the different norm. Note that 

F,^g^m){e)=Pr[Ag{Y)-m{X)<e] 
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= Pr[y < A^^(m(X) +e)] 
= Pr[e < Ko{Kf{m{X) + e)) - nioiX)] 
= EF,[K{K~\m{X)+e)) - mo{X)]. 
Likewise, i^x,e(e,m) satisfies 

Fx,e{e,m) (x, e) = Pr[X < x, Ke{Y) - m{X) < e] 

= EFr[X <x,e< Ao(A0 + e)) - ruoiX)] 

= E[liX < x)Fe[Ao{A~\m{X) + e)) - mo{X)]]. 

Define 

Gmd (6*, m) {x,e) = Fx,e{9,m) {x,e)- Fx {x)F^i^g^rn) (e) • 
Define now the stochastic processes 
L„(x, e) = y/n[Fx,six, e) - Fx,e{x, e)] 

-Fx{x)Vn[Fe{e) - F,{e)] - F,{e)V^[Fx{x) - Fx{x)] 

and 

Cn{0){x,e) =Ln{x,e) + TiMD{x,e){e - 9o) + [r2MD(6'o, mo)(rn - mo)](j;,e), 

where for any ^ G © and any m,m G Al, T2md{(^ ■,fn){jn — 'ni){x,e) is defined 
in the following way. We say that GmdC^,"^) is pathwise differentiable at 
m in the direction [m — m] at (x, e) if {m + r(m — m) : r S [0, 1]} C and 
\im.T-^Q[GMTi{0,Tn + Tim — rn)){x,e) — GMT){G-,'^){x,e)]/T exist; we denote 
the limit by ^2MD(^,''7^)[m — m](x,e). 

A consequence of Lemmas B.1-B.6 is that 

sup \\GnM-D{0-,fhe) - Cn{0)\\l = Op{n~^/'^), 

\\e-eo\\<&n 

which means we can effectively deal with the minimizer of Cn{9), say, 6. 
Note that 9 has an explicit solution and, indeed, 

_ r ^ _ -| -1 

V^{9 - 9o) 



j TiMDriMo'^la;, e) d;u(x, e) 



X j [Ln{x,e) + [T2MD{0o,mo){m - mo)\{x,e)] 

X riMD(a;,e)d^(x,e). 
Then apply Lemma B.6 below to get the desired result. 

Lemma B.l. Uniformly for all 9 Gmd(^)'7i) is continuous {with 
respect to the \\ ■ \\oQ-norm) in m at m = mg. 
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Lemma B.2. For all sequences of positive numbers 5n = o{l), 

sup \\GnMD{0,m) - GMD{0,in)\\2 = Op{l). 

9£B,\\m—mg ||A4<5n 

Lemma B.3. For all {x,e), the ordinary partial derivative in 9 of 
GMD{0,'rne)ix,e), denoted TiMD{0,iT^e){x,s.), exists in a neighborhood of Bq^ 
is continuous at 6 = 60, and the matrix riMD{x,e) = TiMD{Bo,^o){x,e) is 
of full (column) rank for a set of positive ^-measure (x,e). 

Lemma B.4. For fi-all {x,e), the pathwise derivative T2MD{B,mg){x,e) 
of GMT){B,mg){x,e) exists in all directions m — rriQ and satisfies the follow- 
ing: 

(i) \\Gym{0,rn) - GymiO^me) - V2ym{0,me)[m - me\\\2 < c\\m - me\\\^ 

for all 9 with \\9 — 9o\\ = o(l), all m with \\m — mg||_A/( = o(l), some constant 
c < 00; 

(ii) \\^2MD{9,m0)[fho - me] - T 2md {9 o,mo)[m - ruolh 
<c\\9-9o\\ X Op{l) + Op{n-^/^) 
for all 9 with \\9 -9o\\=o{l). 

Lemma B.5. For all sequences of positive numbers {5n} with 6n = o(l), 
sup \\GnMDi9,m) - GMDi9,rn) - GnMD{9 01 j II 2 

\\d~do\\<Sn,\\in-ing\\j^<5n 

= o,(n-V2). 
Lemma B.6. 

y/n J{GnMD{9o, ruo) + T2MD{9o,mo)[rri - mo]}(x, e)TiMD{x, e) dfi{x, e) 

^iV(0,FiMD). 

B.3. Proofs of Lemmas B.l B.6. 

Proof of Lemma B.l. This follows from the representation 

GMDi9,mg){x,e) 

(22) 

= E[[l{X <x)- Fx{x)]Fe[K{^g\me{X) + e)) - mo(X)]], 
and the smoothness of -Fe,Ao and ^. □ 
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Proof of Lemma B.2. Define the linearization 

G'nMD(^i?")(3;,e) = Fx^e{e,m){x,e) - Fx{x)F^(_g^rn){e) 

- Fx(x)F^(0^„)(e) +Fx(3;)F^(e_^)(e). 
By the triangle inequahty, we have 

sup \\GnMD{0,m) - GMDid,m)\\2 

9£&,\\m—mQ\\^<Sn 

< sup \\G^-^Y)(.0,m) -GMD{0,m)\\2 

8G&,\\m~m,g\\^<5„ 

+ sup \\GnMDid,m) - G^yij){9,m)\\2. 

0eO,\\m.~m.g\\M<Sn 

We must show that both terms on the right-hand side are Op(l). Define the 
stochastic processes 

Tne{0, m, e) = F^^e,m) (e) - -?^£(e,m) (e) 

and 

TnXe{0, m, X, e) = Fx^e{e,m) {x, e) - Fx,e{6,m) i^, 

for each 6'g9, mGA^,xG M'^, e G M. We claim that 

(23) sup |T„e(6',m,e)| = Op(l), 

Sg0,||m— mg ||^<5„,eGR 

(24) sup \TnXe{0,'m,X,e)\= Op{l), 
6'e©,||m-me||7vt<(5„,a;GK'=,eGK 

which implies that 

sup \\GnMT){^^Tn)-GMT){9,m)\\2 
= sup \\{Fx,e{B,m) - Fx,e(B,m)) 

9&@,\\m — mg\\M<Sn 

- Fx{Fe{e,m) - Fs{e,m)) - Fe(e,m)iFx - Fx)\\2 

< sup \TnXs{0,Tn,e)\ 

-fGOjIlm— mg ||^<<5n,eGR 

+ sup \Tne{0,Tn,x,e)\+ snp \Fx{x) - Fx{x)\ 

9e0,||m-m9||jvi<'5n,a:GK'=,eeM xSM'' 
= Op{l). 

Similarly, sup^gg |[^_^^|[^<5^ ||G„MD(6',m)-G^^^j3(^'Hll2 = Op(l)- The proof 
of (23) and (24) is based on Theorem 3 in CLV (2003). We omit the details 
because it is similar to our proof of Lemma B.5. □ 
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Proof of Lemma B.3. Below, we calculate riMD(2;,e) = riMD(^o,'"^o) x 
(x,e). In a similar way TiMD{d,n^e){x,e) can be obtained. First, we have 



d_ 

de 



E^Fe[AoiAeHrng{X) + e)) - ^^(X)]^^ 
Ue)E-^Ao{As\mg{X) + e)) 

Ue)EK{A~\mo{X)+e))-^{A^\me{X)+e)) 



feie)EK{A~\moiX)+e)) 



XoiA-\mo{X)+e)) 
A'^iAoHmo{X) + e)) 
1 



+ 



A',{Ao\mo{X) + e)) 
= fe{e)E[\o{A-\mo{X) + e)) + mo{X)] 
by the chain rule. Similarly, 
d 



mo{X) 



x,e 



fe{e)E[l{X < x){Xo{A-\mo{X) + e)) + mo{X)}]. 



Therefore, 



riMD(a;,e) = ^lMD(t'o,"^o)(x,e) = — {x,e) 



(25) 



de 

fe{e)E[{l{X<x)-Fx{x)) 

X {\o{A-\mo{X) + e)) + mo{X))]. 



□ 



Proof of Lemma B.4. By the law of iterated expectation and partial 
differentiation, we obtain that 



[r2MD {Oo, mo) (m - mo)] (x, e) 

_ QCmd {9o,mo + t{m-mo)) 
dt 



{x,e) 



t=o 



h{e)E[{l{X <x)- Fx{x)){m{X) - mo{X))]. 
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Similarly, the formula of [r2MD(^, '7i6i)(m — mg)]{x,e) is given by 
[T2MD {d,'rng){m - me)]{x,e) 

= lim -E[{l{X <x)- Fx{x)}fe[Ao{Ag\7ne{X) + e)} - mo{X)] 

T{m - mg){X) + e)} 

-Ao{A~Hnie{X)+e)}]]. 

The two inequalities in the statement of Lemma B.4 now follow easily, using 
the consistency of fhg and the fact that sup^ | (m^) — m0){x) — {fho — mo) (x) \ = 
Op{l)\\e-9o\\+Op{n-^/^). □ 

Proof of Lemma B.5. Define the stochastic processes 
Une{0,m,e) = Vn[F^(e,m){e) - i^e(9,m)(e)] 

and 

UnXeiO, rn,x,e) = \/n[-Fx, e{e,m){x,e) - Fx,e{e,m){x,e)] 

for each 9:\\6 — 9o\\ < 6n and m : \\m — m0\\_\4 < Sn, x G M'^, e G M. We claim 
that 

(26) sup \une{G,m,e)\=Op{l), 

\\9—9o\\<Sn,\\m—me ||7K<'5n,eeK 

(27) sup \iynXeiS,m,x,e)\ = Op{l). 

l|f-fo||<i5n,||m-me||^<(5„,a:GR'*,eeIR 

The proof of these results are based on Theorem 3 in CLV (2003). We have 
to show that their condition (3.2) is satisfied, which requires in our case 
[with g{Z,e,m) = l{e{e,m) < e) - El{e{e,m) < e) and giZ,e,m) = 1{X < 
x)l{e{e,m) < e) - El{X < x)l{e{e,m) < e)] that 



E 



sup \g{Z,6' ,m') — g{Z,9,m) 

.(6»',m'):||e'-6'||<5,||m'-m||jv(<'5 



1/r 

<K6' 



for all {9,m) G Q x M, all small positive value 6 = o{l), and for some con- 
stants s G (0, 1], K > 0, and that the bound holds for //-almost all (x, e). We 
have 

\g{Z,e',m') - g{Z,6,m)\ < \l{e{e,m) < e) - l{e{e',m') < e)| 

+ \El{e{6,m) < e) - El{e{e',m') < e)| 

and 

\l{e{e,m) < e) - l{e{e' ,m') < e)\ 

= \l{Ae{Y) - m{X) < e) - l{Ae,{Y) - m'{X) < e)| 
< \l{Ae{Y) - mix) < e) - l{Ae{Y) - m'{X) < e)\ 
+ \liAgiY) - m'{X) < e) - l{Ae>iY) - m\X) <e)\. 



30 O. LINTON, S. SPERLICH AND I. VAN KEILEGOM 

For all m' & M with \\m' — m||x < 5 < 1, we have for all Y, X , e 

sup \l{7n'{X) > Ae{Y) - e) - l{m{X) > Ae{Y) - e)| 

|r?i'— r?i||^ <<5 

< l{m{X) + 6> Ae{Y) - e) - l{m{X) -5> Ke{Y) - e). 

The preceding term is either one or zero and its expectation is the probability 
that m{X) + 5'> A.q(Y) — e> m(X) — 6, which is the probability that e + 6 > 
Ag{Y) - m{X) > e - (5, which is 

Fe{e,m) (e + - Fe{e,m) (e - ^) 

= EFe[K{Ks\HX) + e + 5))- mo{X)] 
- EFe[K{Ks\m{X) + e-5))- mo{X)]. 

We then apply the smoothness conditions on F^,Ao and A^"^ to bound 
the right-hand side by K6 for small enough 6 and constant K < oo. 

Next, by the Mean Value Theorem, we have AeiY) - A0f{Y) = Xe*(Y) x 
{6 — 9'), where 9* is an intermediate value between 9 and 9'. For all a > 0, 
by the Bonferroni and Markov inequalities, 



Pr 



max sup \Xgi{Yi)\ > c X n°' 



l<*<'^l|0-e'||<^ 
< n X Pr 



sup |A6»'(y)| > c X n*^ 

\\s-s'\\<s 



<n X 



^[SUP||6 



\\<5 



iA.'(y)r 



■oil), 



provided k> a~^. 

Therefore, we can safely assume that there is some upper bound c such 
that sup||5i_g/||<5 |A5)(y) — Aqi{Y)\ <cxS. Therefore, on this set, 

sup \l{Ag{Y) - m'{X) < e) - l{Ae>{Y) - m'{X) < e)| 

< l{Ae{Y) + c6- m'{X) < e) - l{Ag{Y) - c6 - m'{X) < e)|, 

which has probability bounded by K6 for some K > 0. 

Therefore, condition (3.2) of Theorem 3 in CLV (2003) is satisfied with 
r = 2 and s = 1/2, and condition (3.3) of Theorem 3 is satisfied by the 
condition on the covering number of the class Ad, stated in Assumption B.8. 

□ 
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Proof of Lemma B.6. We show below that 
[^2MDido,mo){m - mo)]{x,e) 

= fe{e)V^ I mX <x)- Fx{x)){m{X) - dX 



(28) 

= feie)^Y.('^iX,<x)-Fxix)) 



1 " 



1=1 

d 

X fx{Xi) Vola{Xai,ei) +Op{l). 

Therefore, 

1 " 

[Ln{x,e) + [T2MD{0o,mo){m - mo)]{x,e)] = —='YUi{x,e) +Op{l), 

^=l 

where 

Ui{x,e) = [1{X, < x)l{ei < e) - Fx,e{^,e)] 

-Fxix)[liei<e)-F,{e)] 

-F,{e)[l{Xi<x)-Fxix)] 
d 

+ fx{Xi) Vola{Xc.i,ei)fe{e){l{Xi <x)- Fx{x)), 
a=l 

and where E[Ui{x,e)] = for all x,e. Because Fx^e{x,e) = Fx{x)Fi;{e), we 
have 

U^{x,e) = [1{X, <x)- Fx{x)][l{ei < e) - F,{e)] 
d 

+fx{X^) J2 VolaiX,,i,e^)fsie)il{Xi <x)- Fxix)). 

a=l 

Now integrating Ui{x, e) with respect to FimdC^;, e) dfi{x, e) gives the answer. 
Proof of (28): Write 



fh{X) - moiX) 

^ X! (—^-J^—^^'"ola{Xai,£i 



nh . 

1=1 a=l 



1 " 



n . , 
1=1 
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Then, provided n/i^'^^ 0, 

/ [{1{X <x)- Fx{x)){m{X) - mo{X))]fx{X) dX 



n d 

In , 



1=1 a=\ 



+ 



{\{X<x)-Fx[x))-K^ 



Xn — Xn 



fx{X)dX 



1 r 

^T.''o2{e^) J [{l{X<x)-Fx{x))]fx{X)dX + Op{l) 

i=l 

n d „ 

y^J2Y.^oia{Xai,ei) I [{l{Xi + uh<x)-Fx{x))Ki{ua)] 

■ ^ 1=1 

X fxiXi + uh)du + Op{l) 



E VolaiXa^,e^)iliXi < x) - Fx(x))/x(X,) + Op(l) 

^ 



1=1 a=l 



We also have to substitute (x) ie=0o into the formula for Timd • D 
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